Goto

Collaborating Authors

 audio-visual dataset


STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Neural Information Processing Systems

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.


Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes

Roman, Adrian S., Balamurugan, Baladithya, Pothuganti, Rithik

arXiv.org Artificial Intelligence

For this reason, the sound localization This technical report details our work towards building an performance strongly depends on the video content enhanced audio-visual sound event localization and detection [10]. This makes models prone to erroneous SELD on frames (SELD) network. We build on top of the audio-only with no audio or uncorrelated audio activity. SELDnet23 model and adapt it to be audio-visual by merging We introduce a visual branch into the audio-only SELDnet23 both audio and video information prior to the gated recurrent baseline from the Classification of Acoustic Scenes and unit (GRU) of the audio-only network.